14 research outputs found
Investigating the Failure Modes of the AUC metric and Exploring Alternatives for Evaluating Systems in Safety Critical Applications
With the increasing importance of safety requirements associated with the use
of black box models, evaluation of selective answering capability of models has
been critical. Area under the curve (AUC) is used as a metric for this purpose.
We find limitations in AUC; e.g., a model having higher AUC is not always
better in performing selective answering. We propose three alternate metrics
that fix the identified limitations. On experimenting with ten models, our
results using the new metrics show that newer and larger pre-trained models do
not necessarily show better performance in selective answering. We hope our
insights will help develop better models tailored for safety-critical
applications
PMU Tracker: A Visualization Platform for Epicentric Event Propagation Analysis in the Power Grid
The electrical power grid is a critical infrastructure, with disruptions in
transmission having severe repercussions on daily activities, across multiple
sectors. To identify, prevent, and mitigate such events, power grids are being
refurbished as 'smart' systems that include the widespread deployment of
GPS-enabled phasor measurement units (PMUs). PMUs provide fast, precise, and
time-synchronized measurements of voltage and current, enabling real-time
wide-area monitoring and control. However, the potential benefits of PMUs, for
analyzing grid events like abnormal power oscillations and load fluctuations,
are hindered by the fact that these sensors produce large, concurrent volumes
of noisy data. In this paper, we describe working with power grid engineers to
investigate how this problem can be addressed from a visual analytics
perspective. As a result, we have developed PMU Tracker, an event localization
tool that supports power grid operators in visually analyzing and identifying
power grid events and tracking their propagation through the power grid's
network. As a part of the PMU Tracker interface, we develop a novel
visualization technique which we term an epicentric cluster dendrogram, which
allows operators to analyze the effects of an event as it propagates outwards
from a source location. We robustly validate PMU Tracker with: (1) a usage
scenario demonstrating how PMU Tracker can be used to analyze anomalous grid
events, and (2) case studies with power grid operators using a real-world
interconnection dataset. Our results indicate that PMU Tracker effectively
supports the analysis of power grid events; we also demonstrate and discuss how
PMU Tracker's visual analytics approach can be generalized to other domains
composed of time-varying networks with epicentric event characteristics.Comment: 10 pages, 5 figures, IEEE VIS 2022 Paper to appear in IEEE TVCG;
conference encourages arXiv submission for accessibilit
Image or Information? Examining the Nature and Impact of Visualization Perceptual Classification
How do people internalize visualizations: as images or information? In this
study, we investigate the nature of internalization for visualizations (i.e.,
how the mind encodes visualizations in memory) and how memory encoding affects
its retrieval. This exploratory work examines the influence of various design
elements on a user's perception of a chart. Specifically, which design elements
lead to perceptions of visualization as an image or as information?
Understanding how design elements contribute to viewers perceiving a
visualization more as an image or information will help designers decide which
elements to include to achieve their communication goals. For this study, we
annotated 500 visualizations and analyzed the responses of 250 online
participants, who rated the visualizations on a bilinear scale as image or
information. We then conducted an in-person study (n = 101) using a free recall
task to examine how the image/information ratings and design elements impact
memory. The results revealed several interesting findings: Image-rated
visualizations were perceived as more aesthetically appealing, enjoyable, and
pleasing. Information-rated visualizations were perceived as less difficult to
understand and more aesthetically likable and nice, though participants
expressed higher positive sentiment when viewing image-rated visualizations and
felt less guided to a conclusion. We also found different patterns among
participants that were older. Importantly, we show that visualizations
internalized as images are less effective in conveying trends and messages,
though they elicit a more positive emotional judgment, while informative
visualizations exhibit annotation focused recall and elicit a more positive
design judgment. We discuss the implications of this dissociation between
aesthetic pleasure and perceived ease of use in visualization design.Comment: 11 pages, 10 figures, 3 tables, accepted at IEEE Vis 202
LINGO : Visually Debiasing Natural Language Instructions to Support Task Diversity
Cross-task generalization is a significant outcome that defines mastery in
natural language understanding. Humans show a remarkable aptitude for this, and
can solve many different types of tasks, given definitions in the form of
textual instructions and a small set of examples. Recent work with pre-trained
language models mimics this learning style: users can define and exemplify a
task for the model to attempt as a series of natural language prompts or
instructions. While prompting approaches have led to higher cross-task
generalization compared to traditional supervised learning, analyzing 'bias' in
the task instructions given to the model is a difficult problem, and has thus
been relatively unexplored. For instance, are we truly modeling a task, or are
we modeling a user's instructions? To help investigate this, we develop LINGO,
a novel visual analytics interface that supports an effective, task-driven
workflow to (1) help identify bias in natural language task instructions, (2)
alter (or create) task instructions to reduce bias, and (3) evaluate
pre-trained model performance on debiased task instructions. To robustly
evaluate LINGO, we conduct a user study with both novice and expert instruction
creators, over a dataset of 1,616 linguistic tasks and their natural language
instructions, spanning 55 different languages. For both user groups, LINGO
promotes the creation of more difficult tasks for pre-trained models, that
contain higher linguistic diversity and lower instruction bias. We additionally
discuss how the insights learned in developing and evaluating LINGO can aid in
the design of future dashboards that aim to minimize the effort involved in
prompt creation across multiple domains.Comment: 13 pages, 6 figures, Eurovis 202
PromptAid: Prompt Exploration, Perturbation, Testing and Iteration using Visual Analytics for Large Language Models
Large Language Models (LLMs) have gained widespread popularity due to their
ability to perform ad-hoc Natural Language Processing (NLP) tasks with a simple
natural language prompt. Part of the appeal for LLMs is their approachability
to the general public, including individuals with no prior technical experience
in NLP techniques. However, natural language prompts can vary significantly in
terms of their linguistic structure, context, and other semantics. Modifying
one or more of these aspects can result in significant differences in task
performance. Non-expert users may find it challenging to identify the changes
needed to improve a prompt, especially when they lack domain-specific knowledge
and lack appropriate feedback. To address this challenge, we present PromptAid,
a visual analytics system designed to interactively create, refine, and test
prompts through exploration, perturbation, testing, and iteration. PromptAid
uses multiple, coordinated visualizations which allow users to improve prompts
by using the three strategies: keyword perturbations, paraphrasing
perturbations, and obtaining the best set of in-context few-shot examples.
PromptAid was designed through an iterative prototyping process involving NLP
experts and was evaluated through quantitative and qualitative assessments for
LLMs. Our findings indicate that PromptAid helps users to iterate over prompt
template alterations with less cognitive overhead, generate diverse prompts
with help of recommendations, and analyze the performance of the generated
prompts while surpassing existing state-of-the-art prompting interfaces in
performance
How Robust are Model Rankings : A Leaderboard Customization Approach for Equitable Evaluation
Models that top leaderboards often perform unsatisfactorily when deployed in real world applications; this has necessitated rigorous and expensive pre-deployment model testing. A hitherto unexplored facet of model performance is: Are our leaderboards doing equitable evaluation? In this paper, we introduce a task-agnostic method to probe leaderboards by weighting samples based on their 'difficulty' level. We find that leaderboards can be adversarially attacked and top performing models may not always be the best models. We subsequently propose alternate evaluation metrics. Our experiments on 10 models show changes in model ranking and an overall reduction in previously reported performance- thus rectifying the overestimation of AI systems' capabilities. Inspired by behavioral testing principles, we further develop a prototype of a visual analytics tool that enables leaderboard revamping through customization, based on an end user's focus area. This helps users analyze models' strengths and weaknesses, and guides them in the selection of a model best suited for their application scenario. In a user study, members of various commercial product development teams, covering 5 focus areas, find that our prototype reduces pre-deployment development and testing effort by 41% on average
Localized ridge defect augmentation using human pericardium membrane and demineralized bone matrix
Background: Patient wanted to restore her lost teeth with implants in the lower left first molar and second premolar region. Cone beam computerized tomography (CBCT) revealed inadequate bone width and height around future implant sites. The extraction socket of second premolar area revealed inadequate socket healing with sparse bone fill after 4 months of extraction.
Aim: To evaluate the clinical feasibility of using a collagen physical resorbable barrier made of human pericardium (HP) to augment localized alveolar ridge defects for the subsequent placement of dental implants.
Materials and Methods: Ridge augmentation was done in the compromised area using Puros® demineralized bone matrix (DBM) Putty with chips and an HP allograft membrane. Horizontal (width) and vertical hard tissue measurements with CBCT were recorded on the day of ridge augmentation surgery, 4 month and 7 months follow-up. Intra oral periapical taken 1 year after implant installation showed minimal crestal bone loss.
Results: Bone volume achieved through guided bone regeneration was a gain of 4.8 mm horizontally (width) and 6.8 mm vertically in the deficient ridge within a period of 7 months following the procedure.
Conclusion and Clinical Implications: The results suggested that HP Allograft membrane may be a suitable component for augmentation of localized alveolar ridge defects in conjunction with DBM with bone chips